Credit Card Users Churn Prediction

Problem Statement

Business Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

Data Description

What Is a Revolving Balance?

What is the Average Open to buy?
What is the Average utilization Ratio?
Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:

Please read the instructions carefully before starting the project.

This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.

Importing necessary libraries

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

Loading the dataset

Data Overview

View the first and last 5 rows of the dataset.

Understand the shape of the dataset.

There are 10127 rows and 21 columns.

Check the data types of the columns for the dataset.

Statistical summary of the dataset.

Checking for duplicate values

There are no duplicates in the data.

Checking for missing values

Take a Look at the statistics for the object columns

Get value counts of the categorical data to see what they contain

Delete CLIENTNUM column

For Analysis, encode Existing and Attrited customers to 0 and 1 respectively

Exploratory Data Analysis (EDA)

The below functions need to be defined to carry out the Exploratory Data Analysis.

Functions

Univariate analysis

Customer_Age

Months_on_book

Credit_Limit

Total_Revolving_Bal

Avg_Open_To_Buy

Total_Trans_Ct

Total_Amt_Chng_Q4_Q1

Total_Trans_Amt

Total_Ct_Chng_Q4_Q1

Avg_Utilization_Ratio

Numerical columns that are better represented using bar/count plots

Dependent_count

Total_Relationship_Count

Months_Inactive_12_mon

Contacts_Count_12_mon

Categorical variables

Gender

Education_Level

Marital_Status

Income_Category

Card_Category

Attrition_Flag

Bivariate Distributions

Check for attributes that have a strong correlation with each other

Correlation Check

Attrition_Flag vs Gender

Attrition_Flag vs Marital_Status

Attrition_Flag vs Education_Level

Customers with Doctorate degrees attrited the most at ~21%; followed by customers with Post-Graduate degrees at ~18%. Customers with College, Graduate, High School degrees and Uneducated customers attrited at ~15-16%.

Attrition_Flag vs Income_Category

Customers making over 120K dollars, customers making less than 40K dollars, and those with an unknown income, attrited the most at ~17%. Customers making between 60K - 80K dollars, attrited the least at ~13.5%. That was the lowest attrition for this group. Customers making between 80K and 120K dollars had ~16% attrition.

Attrition_Flag vs Contacts_Count_12_mon

Attrition_Flag vs Months_Inactive_12_mon

Attrition_Flag vs Total_Relationship_Count

Attrition_Flag vs Dependent_count

Attrition_Flag vs Credit_Limit

Attrition_Flag vs Customer_Age

Attrition_Flag vs Total_Trans_Ct

Attrition_Flag vs Total_Trans_Amt

Attrition_Flag vs Total_Ct_Chng_Q4_Q1

Attrition_Flag vs Total_Amt_Chng_Q4_Q1

Attrition_Flag vs Avg_Utilization_Ratio

Attrition_Flag vs Months_on_book

Attrition_Flag vs Total_Revolving_Bal

Attrition_Flag vs Avg_Open_To_Buy

Questions:

  1. How is the total transaction amount distributed?
  1. What is the distribution of the level of education of customers?
  1. What is the distribution of the level of income of customers?
  1. How does the change in transaction amount between Q4 and Q1 (total_amt_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?
  1. How does the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?
  1. What are the attributes that have a strong correlation with each other?

Data Pre-processing

Outlier Detection

Train-Test Split

Missing value imputation

Check the class balance for the whole data sets

Checking that no column has missing values in train or test sets

Some more sanity checks

Encoding categorical variables

Model Building

Model evaluation criterion

Which metric to optimize?

My Model Evaluation Criterion

Functions

Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

NOTE:


I am going to use a two-pronged approach to build and evaluate models:

Model Building with original data

Decision Tree

Bagging

Random Forest

AdaBoost

Gradient Boosting

XGBoost

Model Building with Oversampled data

Decision Tree

Bagging

Random Forest

AdaBoost

Gradient Boosting

XGBoost

Model Building with Undersampled data

Decision Tree

Bagging

Random Forest

AdaBoost

Gradient Boosting

XGBoost

HyperparameterTuning

Sample Parameter Grids

Note

  1. Sample parameter grids have been provided to do necessary hyperparameter tuning. These sample grids are expected to provide a balance between model performance improvement and execution time. One can extend/reduce the parameter grid based on execution time and system configuration.
    • Please note that if the parameter grid is extended to improve the model performance further, the execution time will increase
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}
param_grid = {
    'max_samples': [0.8,0.9,1],
    'max_features': [0.7,0.8,0.9],
    'n_estimators' : [30,50,70],
}
param_grid = {
    "n_estimators": [50,110,25],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
    "max_samples": np.arange(0.4, 0.7, 0.1)
}
param_grid = {
    'max_depth': np.arange(2,6),
    'min_samples_leaf': [1, 4, 7],
    'max_leaf_nodes' : [10, 15],
    'min_impurity_decrease': [0.0001,0.001]
}
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}

Tuning XGBoost model with original data

Tuning Gradient Boosting model with original data

Tune XGBoost with oversampled data

Tuning XGBoost with undersampled data

Tuning Gradient Boosting with undersampled data

Model Comparison and Final Model Selection

Test set final performance

I will run the two models I chose for academic and production purposes on the test data.

Tuned XGBoost model with UnderSampled data

Tuned XGBoost model with Original data

Feature importance of Tuned XGBoost model with UnderSampled data

Feature importance of Tuned XGBoost model with Original data

Business Insights and Conclusions


NOTE:

Stacking Model

Feature Importance